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Abstract 



When normality does not hold, nonparametric tests represent an important data-analytic alternative to parametric tests. 
However, the use of nonparametric tests in educational research has been limited by the absence of easily performed tests for 
complex experimental designs and analyses, such as factorial designs and multiple regression analyses, and limited 
information about the properties of these tests for realistic data conditions. Efforts to remedy this deficiency have begun 
with the introduction of general linear model-based nonparametric tests. The results of a computer simulation of the 
properties of several of these tests in hierarchial regression analysis indicated that, on balance, the top-performing test for the 
nonnormal distributions studied was the McKean-Hettmansperger F-test using a confidence interval estimate of the scale 
parameter r . The Serlin-Harwell aligned-rank chi-square test performed almost as well, which, combined with the fact that 
it is easier to compute, makes it an attractive competitor to the McKean-Hettmansperger test. 
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Educational researchers commonly examine their data for evidence of problems, nonnormality for example, that can 
threaten statistical conclusion validity. When scores are independently and identically (normally)-distributed with common 
variance cr 2 , parametric tests are optimal for testing general linear model-based hypotheses. When normality does not hold, 
nonparametric tests represent an important data-analytic alternative to parametric tests (e.g., Lehmann, 1975, pp. 171-175; 
Marascuilo & McSweeney, 1977, p. 89; Zimmerman & Zumbo, 1993). (Other criteria for distinguishing between parametric 
and nonparametric tests have been formulated, see, e.g., Kendall & Stuart, 1979, pp. 497-498; Marascuilo & McSweeney, 
1977, pp. 3-6). Comparisons of parametric and nonparametric estimators and tests under realistic data conditions, such as 
small sample sizes and nonnormal data, have spawned a considerable research literature. Despite its size, this literature is 
rather narrow in that its focus has been on relatively simple experimental designs and analyses (e.g., two groups). 

Serlin and Harwell (2001) argued that nonparametric methods are under-used in educational research, in part because 
educational researchers are not aware of nonparametric tests that are available for complex experimental designs and 
analyses, such as factorial designs and multiple regression analyses. They also suggested that many educational researchers 
are not aware that such analyses can be performed using existing computer programs, such as SPSS (SPSS Inc., 1999) or 
Minitab (Minitab Inc., 2000). Serlin and Harwell concluded that the development of general linear model-based 
nonparametric procedures holds great promise for increasing the use of these nonparametric methods in educational 
research. They also pointed out that little is known about the behavior of these tests for realistic data conditions. 

Serlin and Harwell indicated that three general linear model-based nonparametric procedures are especially promising. 
The aligned-rank procedure of Puri and Sen (1971, 1985) tests hypotheses about parameters of interest after eliminating so- 
called nuisance parameters. This involves aligning the raw scores using estimates of the nuisance parameters, ranking the 
aligned values (residuals), and computing a test statistic that follows a chi-square distribution. The rank-based procedure of 
McKean and Hettsmansper (1976) and Hettmansperger and McKean (1977) involves comparing two models (i) a reduced 
model containing nuisance parameters is fitted to the raw data and the residuals obtained and ranked (ii) the ranked and 
unranked residuals are used to compute a measure of dispersion (iii) steps (i) and (ii) are repeated for a full model containing 
the nuisance parameters and the parameters of interest (iv) a test statistic is computed based on the difference in the 



dispersion measures for the reduced and full models. Another promising procedure is the aligned-rank-transform described 
in Fawcett and Salter (1984), in which the rank-transform procedure of Conover and Iman (1976) is applied to data that have 
been aligned for nuisance parameters. Fawcett and Salter did not consider the aligned-rank-transform in a general linear 
model context, but doing so greatly extends the use of this method. 

Theoretically, these procedures allow educational researchers to perform nonparametric tests for data obtained from 
complex experimental designs and data-analytic models using existing data analysis software. Estimators and tests 
associated with the Puri and Sen (1971, 1985) and McKean and Hettmansperger (1976) procedures have similar or identical 
properties asymptotically, and there is some evidence that the asymptotic properties of the aligned-rank-transform are similar 
to those of Puri and Sen and McKean and Hettmansperger. In any event, the literature on their performance for the less- 
than-asymptotic case is quite sparse. For example, available evidence that aligned-rank tests are excellent competitors to 
their parametric counterparts for controlling Type I errors at a and showing good statistical power (e.g., Adiche, 1978; 
Gorham, 1998; Puri & Sen, 1985) is limited to a small number of designs (e.g., randomized-block) and data analyses. 
Similarly, there is evidence that the McKean and Hettmansperger procedure performs reasonably well except when sample 
sizes are small, but available work has focused almost exclusively on factorial models and a few distributions. There has 
been even less study of the aligned-rank-transform procedure. In short, there is a substantial gap in the nonparametric 
literature of the behavior of these nonparametric tests for realistic data conditions. This paper reports the results of a 
computer simulation study of their behavior. 

We first describe the three procedures and available theoretical and empirical evidence of their behavior. Next we 
introduce additional tests suggested by these procedures, and then report the results of a computer simulation study that 
investigated the behavior of the tests. We conclude by describing areas in which additional research is needed. 

Puri and Sen’s Aligned-Rank Test 

Puri and Sen (1985, pp. 238-287) described a general linear model-based aligned-rank procedure, originally 
introduced by Mehra and Sarangi (1967) for main effects in additive models and extended by Sen (1968), that has several 
desirable properties. This procedure, which has its roots in the work of Hodges and Lehmann (1962), assumes that 
F i (y) = F(y i -Bo-B’(x i - x )) , i = 1,2,...,N (1) 
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underlies the data, where Fj(y) is a continuous distribution function for the ith subject, y represents the dependent variable. 



B 0 is an intercept, B is a q x 1 vector of partial regression parameters consisting of qi nuisance parameters in the vector B| 
and q 2 parameters of interest in B 2 (B = Bi,B 2 , q = qi + q 2 ), Xj is a q x 1 vector of fixed and known predictor values for the ith 
subject, and x is a q x 1 vector of means for the predictor variables. 

The Puri and Sen model applies equally to data fitting the correlation model 



in which the predictors are random and F,(y|x) represents the conditional distribution of y given x (Sampson, 1974). The 
assumption of continuity of the F ( theoretically eliminates the problem of tied scores, but when these occur conventional 
practice is to assign midranks. As long as the proportion of ties in the raw data is relatively small the midranks will have a 
negligible effect on the test (Lehmann, 1 975, p. 1 8). 

To test the hypothesis Ho: B 2 = 0, a regression of y on the qi predictors associated with the nuisance parameters is 
performed and the resulting residuals are computed using y, - Xjb|, where bi is a vector of estimated slopes. Either ordinary 
least squares (OLS) parameter estimates or rank estimates can be used to generate these residuals. Aubuchon and 
Hettmansperger (1984) pointed out that the impact of OLS versus rank estimates on tests in small samples is unknown, but 
most authors (e.g., Hettmansperger & McKean, 1998; Puri & Sen, 1985; Adiche, 1984) employ OLS estimates. 

Under model assumptions the residuals are free of the effects of the qi nuisance parameters. They are then ranked 

(say, Ri). Next, the linear rank statistics L k (Puri & Sen, 1985, p. 247) are computed for each predictor/dependent variable 
pairing for the q 2 predictors associated with the regression parameters of interest: 



Using the original x^ values in equation (3) is an application of what Puri and Sen call the mixed-rank model; ranking the x* 
and using these ranks in the analysis is an application of the pure-rank model. These models lead to estimators and tests with 
the same properties, and we focus on the mixed-rank model. 



Fi(y|x) = F( yi -B 0 -B’(Xi- x)) 



( 2 ) 



N 




(3) 



i=i 



It is clear in equation (3) that the L k are proportional to the slopes, and that centering the predictors produces a 



nonparametric analogue of the least squares normal equations. This proportionality means that researchers do not need to 

compute the L k and can instead simply compute slopes with the usual OLS expressions. The L k (or slopes) are highly 
efficient compared to the usual least-squares estimators for a normal distribution and are asymptotically normal, clearing the 
way for an omnibus test statistic that follows a chi-square distribution (Puri & Sen, 1985, chpt. 7). Another advantage of 

the Ly is that they are robust compared to the usual least squares estimators that minimize (y; - Xjb)’(yi - Xjb), because 
the effect of outliers enters in a linear rather than a quadratic fashion (Draper, 1988). 

Puri and Sen (1985, p. 247) proposed an aligned-rank test (PSAR) based on the L k and their asymptotic variances that 
can be written in the form 

PSAR=(N-1) 6 ~ (4) 

where 0 represents a measure of explained variation. The PSAR test is asymptotically distribution-free and is 
asymptotically distributed as a chi-square variable with q 2 degrees of freedom. (The PSAR test is identical to the test 
proposed by Adichie (1978) for the single predictor case). In a regression model # equals SSRegression/SSTotal, where 
SSRegression = b’ 2 ssx 2 b 2 , b 2 is a q 2 xl vector of estimated slopes for the parameters of interest, ssx 2 is a q 2 x q 2 sum of CrOSS- 

iy _ 

products matrix for the predictors associated with the parameters of interest, and SSTotal = V (R; - R ) 2 , where R is the 

/=i 

overall mean of the ranks. As illustrated in Serlin and Harwell (2001), the PSAR test can easily be computed using existing 
software. 

The assumptions underlying the PSAR test are that the yi are continuous, independently and identically distributed, 
and that the sample size is large enough to ensure the validity of probabilistic inference based on a chi-square distribution. 
Under the assumption that the y f follow a normal distribution, the A.R.E. of the PSAR test using ranks compared to the 
normal-theory likelihood ratio test is approximately .96, and the A.R.E. of the PSAR test for normally-distributed data with a 
normal-scores transformation compared to the likelihood ratio test is one (Puri & Sen, 1985, pp. 251-252). 

Although contrasts for the B k are available (Puri & Sen, 1985, chpt. 6), the PSAR procedure does not permit model- 
checking that is common in regression (e.g., studentized statistics). Aligned-rank procedures like PSAR have also been 



criticized on the grounds that their performance is suboptimal in certain settings. For example, Hettmansperger and McKean 
(1983) described computer simulation results for a test of parallel slopes in which an aligned-rank test showed inflated Type 
I error rates. 

Research for the PSAR Test 

Studies of the PSAR test have overwhelmingly focused on the factorial case. Akritas (1990) provided some A.R.E. 
comparisons between a rank-transform test and the PSAR test for a completely between-subjects factorial design. These 
results demonstrated that the A.R.E. of the usual rank-transform F-test was higher than the PSAR test for a normal 
distribution, but the A.R.E. of PSAR exceeded that of the rank-transform F-test for logistic and double exponential 
distributions. However, Brunner and Dette (1992) argued that the rank transform test considered by Akritas (1990) was not 
really a rank- transform test, but instead was a test in which the ranks are divided by an estimated standard deviation and 
then substituted into the usual F-test. 

Harwell (1991) used a simulation study to examine the behavior of the PSAR test. For a 3 x 2 design with various cell 
sizes and distributions, Harwell found that as long as cell sample sizes were at least 8 the PSAR test controlled its Type I 
error rate and showed good power compared to the F-test and some nonparametric competitors; for smaller cell sample sizes 
the Type I error rates were frequently inflated. Toothaker and Newman (1994) also used a factorial design and found that 
the PSAR test had inflated Type I error rates for cell sample sizes of 5; for larger samples sizes the test controlled its error 
rate at the nominal level. Other simulation studies investigating the PSAR test include Conover and Iman (1976), Harwell 
and Serlin (1994), McSweeney (1967), Salter and Fawcett (1993), and Yohai and Ferretti (1987). The general result of these 
studies is that the Type I error rate differed noticeably from the nominal value for small sample sizes, for example, 3-5 cases 
per cell in a factorial design, with the test sometimes performing better for certain distributions. For larger sample sizes, the 
PSAR test generally did a good job of controlling its Type I error rate and showed good power. 

McKean and Hettmanspereer’s Rank-Based Test 

Building on the work of Jaeckel (1972), McKean and Hettmansperger (1976, 1977) described a two-step modeling 
procedure in which a sum of products of the ranked and unranked residuals for a reduced model containing nuisance 
parameters are compared to a sum of products for the ranked and unranked residuals for a full model containing the 



parameters of interest plus the nuisance parameters. The McKean-Hettmansperger method is similar to the usual least- 
squares model-fitting procedure, and supports model-checking procedures and contrasts for the B k (McKean, Sheather, & 
Hettmansperger, 1990). 

The test is based on a reduction in residual dispersion (RD) assuming that the model in equation (1) or (2) underlies 
the data. McKean and Hettmansperger defined the linear rank statistic 

Di(yi-Xib) = 211, (y,-x,b) [R, w (y,- x,b)], (5) 

where Rj W represents Wilcoxon scores generated from the score function (p = (12) l/2 (Rj/(N+l) -1/2). Use of the Wilcoxon 

scores ensures that Dj(yj - Xjb) in equation. (5) does not depend on the intercept (Draper, 1988). Hettmansperger and 
McKean (1998, p. 163) showed that the statistic in equation (5) is asymptotically unbiased, and if the population distribution 
is symmetric it is unbiased for any sample size. They also claimed that their statistic was resistant to the effects of outliers. 
However, Puri and Sen (1985, p. 282) noted that the statistic in equation (5) is more susceptible to outliers than the statistic 
in equation (2) because of the use of the residuals (yj - xjb). 

Solving equation (5) for the slopes produces a rank analogue of the normal equations, which do not have a closed 
solution except in the case of q = 1 . Thus, estimating the slopes requires an iterative technique and specialized software 
(Hettmansger & McKean, 1 998, pp. 1 84-189). Draper (1988) pointed out that the estimated slopes are not necessarily 
unique because they may reflect one of several minima, but indicated that experience has suggested this is rarely a problem. 
The result is a fitted rank-based regression model with q slopes that are highly efficient compared to the usual least-squares 
estimators fora normal distribution. 

The RREGRESS command in Minitab (Minitab Inc., 2000) will estimate the slopes that minimize the expression in 
equation (5), as will the interactive RANOVA program maintained by J. McKean at the web address 
http://www.stat.wmich.edu/slab/RGLM/index.htm. These slopes have an approximate distribution of N(B k , T 2 (x’x)' 1 ), 
permitting confidence intervals to be constructed about the B k (Hettmansperger & McKean, 1998, p. 1 89). If a sample 
intercept is desired Aubuchon and Hettmansperger (1984) recommended using the median of the residuals, although if the 
distribution of residuals is assumed to be symmetric the median can be estimated using ranks along with the slopes (McKean 
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& Hettmansperger, 1978). 

To adapt the statistic in equation (5) to test hypotheses McKean and Hettmansperger (1976) employed the following 
strategy: Suppose we wish to examine the contribution of x 2 to explaining variation in y after taking into account the 
contribution of xi. The McKean-Hettmansperger procedure begins by fitting a reduced model of the form y j= b 0 i + X| bi to 
the yi, where b 0 i is an estimated intercept and y \ represents fitted values. Then Dj(yj - Xjb) in equation (5) is computed for 

the reduced model. Next a full model of the form y j= b 02 + X|b| + x 2 b 2 is fitted to the data and the statistic in equation (5) is 

( 

computed a second time. 

McKean and Hettmansperger (1976) suggested that the rank-based statistic, 

MHCHI = RD/(f/2) (6) 

be used to test Ho: B 2 0. In equation (6), RE) Dj (\j x ^ ; b ^ )rednced model - Dfyj Xubp x 2 jb 2 )jfyu model ^nd t is an estimate of a 
scale parameter r , similar to the least squares parameter cr . The McKean-Hettmansperger chi-square test (MHCHI) is then 
compared to a chi-square value with q 2 degrees of freedom. Computer simulation studies by Hettmansperger and McKean 
(1977) and Draper (1981) suggested that the test statistic in equation (6) be modified to the form 
MHF = (RD/q 2 )/(f /2) ' (7) 

and compared to a F critical value with q 2 and N - q -1 degrees of freedom. The McKean-Hettmansperger F (MHF) and 
MHCHI tests are consistent, asymptotically distribution-free, and have the same A.R.E. under normality as the PSAR test 
(McKean & Hettmansperger, 1998, pp. 175-178). 

A key feature of the tests in equations (6) and (7) is that they require estimates of r . The parameter r is used in 
several settings in nonparametric procedures, for example, in efficiency evaluations and to rescale tests so that they follow a 
known distribution (Hettmansperger, 1984, p. 244)). There are various ways to estimate r (Revesz, 1984), but we focus on 
the two that are available in the RREGRESS command in Minitab (Minitab Inc., 2000). One is a Lehmann-type estimator 
based on the standardized length of a 90% Wilcoxon confidence interval (McKean & Hettmansperger, 1976). The other 
method used to estimate r in RREGRESS is based on kernel estimation. (Details of how r was estimated in our simulation 
study appear in Appendix A). Unfortunately, there is currently no widely available software that will perform the MHCHI or 



MHF tests, but the RD statistic can be calculated with existing regression software in programs such as Minitab (Minitab 
Inc., 2000) and SPSS (SPSS Inc., 1999). 

Research for the McKean-Hettmansperger Tests 

Several computer simulation studies of the behavior of the MHF test have been done but apparently all have been 
quite limited in scope, almost always involving a focus on Type I error rates for a few conditions in a two-factor design. 
McKean and Sievers (1989) reported that the MHF test maintained its Type I error rate near the nominal value (with one 
exception) in an unbalanced 3x3 design with interaction for two heavy-tailed distributions (logistic, log-Pareto). 
Hettmansperger and McKean (1977) examined the MHF test for a balanced 3x3 design, small cell sample sizes (3, 5), and a 
double-exponential distribution, and reported that the estimated Type I error rates were consistently inflated for small 
samples (5-10). Hettmansperger and McKean (1983) performed a computer simulation study that used MHF to test for 
parallelism of slopes for three groups, sample sizes of 5 or 10, and double-exponential and Cauchy distributions. The MHF 
test maintained its Type I error rate near the nominal value for all conditions examined, and showed good power compared to 
some nonparametric competitors. Other studies of the MHF test include McKean and Hettmansperger (1978), Sievers and 
McKean (1986), McKean and Sheather (1991), McKean, Vidmar, and Sievers, (1989), Sievers and McKean, (1986), and 
Hettmansperger and McKean (1977). The general finding from these studies was that the MHF test maintained its Type I 
error rate unless cell sample sizes were quite small, and that the test frequently performed well for heavy-tailed distributions. 
Aligned-Rank-Transform Test 

The rank-transform procedure introduced by Conover and Iman (1976) requires that scores be ranked and submitted to 
a parametric test. This procedure is known to work well for tests in simple designs but less well for many complex designs, 
for example, factorial designs (Akritas, 1990). Salter and Fawcett (1984) provided a possible solution to this problem for a 
randomized block design by suggesting that the rank-transform be applied to data that have been aligned for nuisance 
parameters, producing an aligned-rank-transform test. Akritas (1991) pushed this notion further by suggesting that an 
aligned-rank-transform test could be applied to subhypotheses of the general linear model. The aligned-rank-transform test 
is the same as the PSAR test except that the test statistic is divided by q 2 to produce an (approximate) F that is then compared 
to a critical F value with q 2 and N - q - 1 degrees of freedom. The test is easily computed with available data analysis 



software. 



Research for the Aligned-Rank-Transform Test 

Mansouri and Chang (1995) provided theoretical results for the ART test and also reported simulation results for a 
factorial design that showed the test controlled its Type 1 error rate at a . Mansouri (1998) provided the limiting distribution 
and A.R.E. of an ART test for a balanced incomplete blocks design. Kepner and Wackerly (1996) provided A.R.E. 
comparisons for an ART test for a balanced incomplete repeated measures design using Wilcoxon rank scores, and showed 
that the test was particularly attractive for heavy-tailed distributions. Akritas (1993) presented an aligned-rank-transform 
test that can be used when data are heteroscedastic. 

Salter and Fawcett (1984) performed a computer simulation study that provided evidence that the F distribution could 
provide a satisfactory approximation to the distribution of their aligned-rank-transform (ART) test in a randomized block 
design. Salter and Fawcett (1993) studied the ART test for a completely between-subjects factorial design with varying 
sample sizes and distributions, and found that the test controlled its Type I error rate and showed good power for cell sample 
sizes greater than 10. Other studies of the ART test have been reported by Groggel (1987), Harwell and Seri in (1994), and 
Gorham (1998). The ART test frequently produced conservative Type I error rates for small samples, with corresponding 
low power; for larger sample sizes, however, the test typically performed well. 

These results suggest that an ART test may be an important competitor to the PSAR and MHCHI/MHF tests. 
However, there do not appear to be any studies available of an ART test in hierarchical regression. 

Before continuing, it is important to point out that the statistical hypotheses tested by the various nonparametric tests 
may differ from one data analysis to another. Recall that the PSAR and MHCHI/MHF tests test the hypothesis Ho: B 2 = 0. 
Akritas and Arnold (1994) pointed out that rejection of this Hodoes not necessarily imply that the slopes do not equal zero 
unless additional assumptions are imposed on the data to ensure that rejection is attributable to nonzero slopes and not to 
other distributional characteristics such as scale, skewness, and/or kurtosis. 

Akritas and Arnold argued that nonparametric hypotheses should be defined in a way that does not place additional 
assumptions on the data, such as by writing Ho in terms of distribution functions. For example, they would replace 



Ho: B 2 = 0 with Ho: Fj(y) = F(yj|(xj - x ), with the latter Ho described as fully nonparametric because it does not directly 
depend on any parameters (Akritas & Arnold, 1994). Under the null hypothesis, Ho: B 2 = 0 and Ho: F;(y) = F(yi|(xj - x ) are 
identical but under Hi they differ because the fully nonparametric version does not directly attribute the rejection to nonzero 
slope parameters. Following the lead of Puri and Sen (1971, 1985), Adiche (1978), Hettmansperger (1984), and others we 
assume that all of the nonparametric tests test Ho: B 2 = 0. Clearly, the data must be examined for evidence supporting the 
additional assumptions associated with writing Ho in this fashion (Hettmansperger & McKean, 1998, p. 234). 

Serlin-Harwell Aligned-Rank Procedure (SHARP) 

An examination of the strategies behind the PSAR, MHCHI, MHF, and ART tests suggests other nonparametric tests 
that can be constructed to examine the effects of a set of variables after the contribution of other variables has been removed. 
Recall that in the PSAR test the first step is to create residuals, which are free of the effects of the nuisance variables, and in 
the McKean-Hettmansperger procedure a function of the ranks of residuals from the full and reduced models are compared. 
We consider a marriage of these strategies. 

Suppose we wish to examine the effects of the predictors X 3 -X 4 on y after the effects of x,-x 2 were taken into account. 
Suppose also that x r x 2 were used to predict y, the residuals obtained and ranked, and a sum of squares regression 
(SSReg reduced model) obtained by using x r x 2 to predict the ranked residuals. Computing SSReg red ucedmodei/SSTotal then 
produces Reduced model- Next we predict the ranked residuals using X 1 -X 4 and compute R fuiimodei- The hypothesis Ho: B 2 = 0 
can be tested using 

SHARPCHI =(N-qi-l)[(R 2 fu || model" R reduced modelV( 1 " R reduced model)] ~ X q 2 (®) 

The (N-q r l) are the degrees of freedom associated with the sum of squares left over after the ranked residuals are predicted 
from the reduced model. Dividing SHARPCHI by q 2 produces a statistic with an F distribution with q 2 and N-q-1 degrees of 
freedom (SHARPF). We note that Mansouri (1996) proposed a similar test of the form 

q 2 [(SSError red uced model - SSErrorfuii modei)/MSError fl ,n mo dei], which follows a chi-square distribution with q 2 degrees of freedom. 

The various tests have many similarities. The PSAR, MHCHI, and SHARPCHI tests are based on a quadratic form in 
the ranks that are asymptotically distributed as a chi-square variable, and can easily be converted to F-tests. In addition, the 
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tests produce the same or similar results when applied to data from simpler experimental designs (e.g., a single-factor 
design); however, for more complex designs there may be sharp differences in their behavior. There may also be differences 
in their behavior for small sample sizes and particular distributional forms. 

Simulation Study 

As indicated previously, literature on the performance of general linear model-based nonparametric tests has largely 
been limited to analyses based on factorial designs. The Type I error and power performance of the PSAR, MHCHI, MHF, 
ART, SHARPCHI, and SHARPF tests were examined for realistic data conditions for the hierarchial regression model. We 
chose hierarchial regression to compare the nonparametric tests for two reasons. First, this procedure has been regularly 
used in educational research, and, second, this study will add to the nonparametrics literature because there is apparently no 
evidence of the behavior of these tests for the hierarchial regression model. 

We used a computer simulation study to generate data for a hierarchial regression model, which in turn were used to 
examine the Type I error and power behavior of the various tests. Although an analytic approach to the behavior of these 
tests is preferred because of its generalizability, available theoretical work for the nonparametric general linear model-based 
tests investigated in this study, when such results exist, assumes that quite large sample sizes are present. Since large sample 
sizes may not occur in practice, it is important to study a test’s behavior under realistic conditions, such as small sample sizes 
and different distributions (Draper, 1988). We used traditional rank and Wilcoxon scores for the various nonparametric 
tests, although we acknowledge that choice of scoring functions is important because different score functions give rise to 
estimators and tests with different properties (Draper, 1988; Naranjo & McKean, 1997; Policello & Hettmansperger, 1976). 

The hierarchial regression model in the simulation had a total of four predictors: x r x 2 represented the reduced model 
and x r X 4 the full model. The statistical null hypothesis tested was Ho: B 2 = 0. The design factors of the simulation study 
were (a) Distribution of the residuals (normal with skewness ( y 1 ) and kurtosis ( y 2 ) of 0; chi-square with 8 df ( y 1 = 1 , y 2 = 
1.5); chi-square with 4 df (y \ = 1 . 41 , y 2 = 3); approximate Cauchy (y 1 = 0, y 2 = 25)), (b) Sample size (N = 20, 40, 60, or 
80 representing sample size to number of predictor ratios of 5:1, 10:1, 15:1, and 20:1, respectively, for Type I error rate runs; 
N=20, 60 for power runs); (c) p x lx2 = p x3x4 = 0.0 or 0.3. In all cases the residuals were homoscedastic and the p xi^ 
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~Px 1x4 P x2x3 P x2x4 0.3. 

The selection of sample sizes and distributions was made based on conditions observed in the educational research 
literature. The chi-square distributions with 8 and 4 degrees of freedom represent increasingly skewed and kurtic data, 
whereas the approximate Cauchy represents an extremely heavily-tailed distribution and is important to examine because of 
theoretical and empirical evidence that rank-based tests are often superior to parametric tests for such distributions. The 
p x)x 2 = p x3 x4 correlation of 0 represents the ideal case of no overlapping variation among predictors, whereas p x ix3 - p xix4 

= P x2x3 = P x2x4 = -30 means there was some shared variation among predictors but not enough to raise concerns about 
collinearity. (The same logic guided our selection of p x ix3 = P xix4 = P x2x3 ~P x2x4 = -30). Assuming a normal distribution, 
the data were generated such that the x r x 2 predictors accounted for 20% of the variance in y; the remaining predictors (X3-X4) 
were added to the model and accounted for differing amounts of additional variance in y (expressed through correlations 
p ^ = p yx 4 ) needed to achieve a theoretical power of .70 for varying sample sizes. When the partial correlation Ryx 3 x 4 .xix 2 

equaled 0, rejections of Ho: B 2 = 0 counted toward the estimated Type I error rate; for the non-zero case rejections counted 
towards the estimated power. 

Data Generation 

The following steps were taken to generate data with the desired characteristics ( 1 ) 5*N scores following a 
multivariate-normal distribution were generated using the Kaiser and Dickman (1962) procedure. These values were then 
transformed to the various nonnormal distributions following the Vale and Maurelli (1983) procedure, which combines the 
Kaiser and Dickman (1962) and Fleishman (1978) procedures. Evidence of the success of the data generation for various 
distributions is provided in Table 1 for two representative same sizes (N = 20, 60). Details on the computation of 
correlations among the x k and y variables used to generate estimated Type I error rates and power values is given in 
Appendix B. 

Overall, the design of the simulation involved 4 (distribution) x 4 (sample sizes for Type 1 error rate runs) x 2 
(correlations with pairs of predictors) + 4 (distribution) x 2 (sample sizes for power runs) x 2 (correlations with pairs of 
predictors) = 48 conditions. Twenty-thousand replications per condition were used to estimate the Type I error rate and 
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power of the tests, ensuring that estimates showed little sampling error. For each simulated dataset, the PSAR, MHCHI, 
MHF, ART, SHARPCHI, SHARPF, and parametric F-tests were calculated and compared to the appropriate critical value 
for the null case, producing a proportion of rejections (a ) for each condition studied (In all cases the nominal Type I error 
rate was .05). Both the confidence interval and window methods for estimating r described in Appendix B were used. As a 
result, each McKean-Hettmansperger test was computed using a confidence interval-based estimate as well as a window- 

based estimate. A similar process was followed for the power case (1- f5 ). The simulation program was written in 
Microsoft Fortran 4.0. 

Results 

Type I Error Results 

Summaries of the Type I error performance of the tests are displayed in Table 2. The estimated Type I error rates of 
the McKean-Hettmansperger tests using window estimates of r were very similar to those produced by the tests based on 
confidence interval estimates, and only the latter are presented. Table 2 shows that the MHF test using a confidence interval 
estimate for r (MHFCI) produced an average estimated Type I error rate closest to the nominal value of .05 (.0517). Next 
closest is the SHARPCHI test with an estimated error rate of .0535, followed, in order, by PSAR (.0449), SHARPF (.0392), 
MHCHI chi-square test using a confidence interval estimate for r (MHCHICI, .0662), ART (.0334), and the parametric F- 
test (.0675). The MHFCI, SHARPCHI, and the PSAR tests overall provided satisfactory control of Type I error rates, with 
the remaining tests providing somewhat less control. As expected, for a normal distribution the average Type I error rate of 
the parametric F-test was closest to the nominal value (.0502), followed by the MHFCI (.0468), SHARPCHI (.0450), 
MHCHICI (.0612), SHARPF (.0329), PSAR (.0304), and ART (.0212) tests. 

Table 3 reports the estimated Type I error rates for each test by distribution and sample size (Evidence described 
below indicated that the correlation within pairs of predictors of 0.0 or .30 did not have much effect on error rates and this 
variable was not included in Table 3). Many of the estimated error rates were reasonably close to .05 and indicated that 
several tests often showed adequate control of error rates. However, for purposes of recommending one or more of these 
tests, we further characterized their control of Type I error rates through the use of four categories: Estimated a values 



within the range .05 ± 2SD [SD = { a {\-Ct )/20,000)] l/2 } = .0470 to .0530 were considered to represent excellent control; 
values between .05 ± 2-3 SD (.0453 to .0469, .053 1 to .0546) were characterized as mildly inflated or conservative error 
rates that represented good error rate control (identified in Table 3 with an *); values between .05 ± 3-4 SD (.0438 to .0452, 
.0547 to .0561) were characterized as mildly to moderately inflated or conservative error rates that represented adequate 
control and are indicated by a +; values outside .05 ± 4SDs (< .0438 or > .0561) represented more pronounced inflation or 
overly conservative values that reflected unsatisfactory control of error rates and are indicated by a &. These categories 
allowed us to further discriminate tests showing excellent control of Type I error rates from those showing less satisfactory 
control (Recall that the use of 20,000 replications means that each estimated error rate in Table 3 should be close to that 
test’s “true” a for the conditions studied). 

An examination of Table 3 shows that for a normal distribution the F-test, as expected, produced a values close to .05 
regardless of sample size, providing further evidence of the credibility of the simulation. Across the full set of conditions 
reported in Table 3, the PSAR, ART, MHCHICI, F, and SHARPF tests showed unsatisfactory control of error rates for more 
than half of the conditions studied, with the ART and PSAR tests consistently producing overly conservative values and the 
others inflated values. Unlike some previous simulation studies, the McKean-Hettmansperger tests were sensitive to the 
Cauchy distribution. The MHFCI test, on the other hand, did a good job of controlling Type I error rates, followed by the 
SHARPCH3 test which showed adequate control except for the Cauchy distribution. Still, the number of independent 
variables in the simulation design means that complex effects in the estimated error rates, such as whether the relationship 
between error rates and distribution depends on sample size, may be present but are not immediately discernible in these 
data. We next attempted to tease out this information. 

Following the advice of Hoaglin and Andrews (1975) to analyze data from simulation studies for evidence of 
important patterns, we fitted three-way ANOVA models to the estimated Type I error rates for each test to determine which 
effects appeared to be the largest and whether interactions needed to be considered in the interpretation. We first 
transformed the a using an arcsin transformation described in Marascuilo and McSweeney (1977, pp. 147-148) that 
produces values whose mean and variance are independent of one another, and whose sampling distribution is quickly 



approximated by a normal distribution. Because the design was unreplicated we did not model the three-way interaction, 
which allowed within-cell variance estimates to be obtained. Model-checking revealed no strong departures from normality 
or homogeneity of variance. Each effect was tested using a nominal Type error rate of .05. 

The ANOVA results are summarized in Table 4 in the form of 7 2 statistics, defined as the sum of squares of a 

statistically significant effect over the sum of squares total. We also computed d > 2 statistics (Hays, 1973), which are less 
biased than r) 2 statistics as measures of effect size. The cb 2 statistics were similar to the r) 2 values, however, and only the 
latter are reported in Table 4. The majority of the two-way interactions were not statistically significant, and among the 
handful that were the associated 7 2 never exceeded .048. As a result, our focus was on the main effects, and Table 4 reports 
fj 2 values for the Distribution, Correlation Within Pairs of Predictors (x r X 2 , X3-X4), and Sample Size effects. The small 
7 2 within-Pair values provide a rationale for not including this factor in reporting the estimated Type I error rates in Table 3. 

Table 4 indicates that the MHFCI (.83) test was quite sensitive to the underlying distribution, along with the PSAR 
test ( 47). The SHARPF (.08) test, on the other hand, showed noticeably less sensitivity to distribution. The estimated error 
rates of several tests (SHARPF, MHCH1CI, ART, SHARPCH1) were also sensitive to sample size, which was expected 
because several of the tests have been shown to have an asymptotic error rate of a . The fact that the PSAR (.20) test was 
less sensitive to sample size was unexpected.. 

The sensitivity of the tests to sample size was explored further by re-running the ANOVAs with Sample Size 
restricted to the N = 60 or 80 cases, which should reflect the asymptotic behavior of a more than N = 20 or 40. That is, re- 
running the ANOVAs with N = 60 or 80 should shrink the 7 Sample size values of the tests. The values in parentheses in Table 
4 for 7 Sample size are f° r the N = 60 or 80 case and provide evidence that the tests behaved as expected theoretically, with all 
nonparametric tests showing substantial to huge decreases in 7 2 when using larger sample sizes. 

In sum, the results reported in Tables 2-4 suggest that the parametric F-test be used for normally-distributed data, as 
predicted by theory. For the nonnormal distributions studied the performance of the MHFCI test was superior to the others, 
followed by the SHARPCHI test. The PSAR and ART tests, on the other hand, performed poorly for most conditions 
studied. The remaining tests showed a mixed Type I error rate pattern, doing well for some conditions and less well for 
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others. 



Power Results 

Summaries of the overall power performance of the nine tests are displayed in Table 2. For a normal distribution the 
average power values were MHCHICI (.7094), F (.6985), MHFCI (.6588), SHARPCHI (.6617), and SHARPF (.5653). 
Because of the difficulty of interpreting power values when Type I error rates stray from the nominal value (.05), we focus 
on conditions in which the tests did not show pronounced inflation ( a < .0562). 

Table 5 reports individual power values (Power rates associated with a > .0562 are indicated with a &). For a normal 
distribution, the estimated power values of .6984 and .6909 for the F-test for N = 20 and 60, respectively, provided further 
evidence of the credibility of the simulation. For the moderately and strongly skewed/kurtic chi-square distributions the 
MFIFCI produced the highest power, with values fairly close to .70, followed by the SHARPCHI and SHARPF tests. The 
ART and PSAR tests performed poorly, which is not surprising given their conservative Type I error rates for most 
conditions. Factorial ANOVAs were not run for the power values because removing those values associated with an inflated 
a leaves a substantially unbalanced design. 

Summary 

The simulation results suggest the following conclusions (1) As predicted by theory, the parametric F-test should be 
used for normally-distributed data regardless of sample size. (2) For the nonnormal distributions studied, the McKean- 
Hettmansperger F-test using a confidence interval estimate of X produced Type I error rates close to .05 and showed good 
power even for smaller sample sizes, followed by the Seri in-Harwell aligned-rank chi-square test. (3) The performance of the 
Seri in-Harwell aligned-rank F-test was mixed, while those of the Puri and Sen, aligned-rank-transform, and McKean- 
Hettmansperger chi-square test with a confidence interval estimate were generally poor. 

On balance, the top-performing test for the nonnormal distributions studied was the McKean-Hettmansperger F-test 
using a confidence interval estimate. The Serlin-Harwell aligned-rank chi-square test performed almost as well for many of 
the conditions studied, which, combined with the fact that it is easier to compute, makes it an attractive competitor to the 
McKean-Hettmansperger test. 





Future Research 



This study provides evidence of the behavior of several general linear model-based nonparametric tests in a 
hierarchial regression analysis under realistic conditions. Future work might include studying the behavior of the tests for 
other statistical procedures and conditions, such as factorial designs with heteroscedastic and nonnormal data. The result of 
this work will be the development of a literature that will provide educational researchers with credible nonparametric 
alternatives when analyzing data from complex research designs and data analyses. 
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Table 1 



Evidence of the Success of the Data Generation 



Sample 

Size 


Distribution 


Mean 


Variance 


Skewness 


Kurtosis 


( P xlx2 0) 

r xix2 


(Pxlx3“ *3) 
^ xlx3 




Normal 

(r i=o, y 2 = 0 ) 


-.0005 


1.0006 


.0007 


.0018 


.0001 


.3011 


N = 20 


Z 2 (df= 8) 

(?>= 1,72=1-5) 


-.0001 


1.0007 


1.002 


1.515 


-.0022 


.3000 




X 2 (df=4) 
(r.=i.4i,r 2 =3) 


-.0001 


.9988 


1.412 


2.984 


-.0062 


.2981 




Cauchy 
(y i=0, 72=25) 


-.0006 


.9985 


-.0053 


25.144 


-.0017 


.2998 




















Normal 


-.0004 


.9999 


.0008 


-.0019 


.0003 


.3001 


N=60 


2 (df^8) 


-.0001 


1.001 


1.000 


1.503 


.0011 


.3005 




X 2 (df=4) 


-.0003 


.9990 


1.412 


2.985 


.0001 


.2991 




Cauchy 


-.0017 


1.001 


-.0168 


24.794 


.0002 


.3005 



Each tabled value represents an average. The mean, variance, skewness, and kurtosis are based on data for N cases for 5 
variables (4 predictors and y) across 20,000 replications for both Type I error rate and power conditions. For example, the 
first entry in the table of -.00054 is based on 20*5*20,000*2 = 4,000,000 scores. The results for the average correlation 
r x 3 x 4 were similar to those for r x ix 2 , and the results for r x i X 4 and r x 2 x 4 were similar to those for r x ix 3 - 
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Summary of Estimated Type I Error and Power Values for the Tests 
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Table 4 



Effect Size Estimates for ANOVAs of Estimated Type I Error Rates 



Test 


* 2 

V Distribution 


* 2 

7 Within-Pair 


* 2 

7 Sample Size 


F 


.95 


.02 


< .01 (not sig.) 


PSAR 


AH 

.“T i 


not sig. 


.20 (not sig.) 


MHCHICI 


.29 


.05 


.63 (.09) 


MHFCI 


.83 


.04 


not sig. (not sig.) 


ART 


.28 


not sig. 


.49 (.02) 


SHARPCHI 


.46 


.04 


.40 (not sig.) 


SHARPF 


.08 


.01 


.88 (not sig.) 



F = parametric F-test, PSAR = Puri and Sen aligned-rank test, MHCHICI = McKean-Hettmansperger chi-square 
test with confidence interval estimate of r , MHFCI = McKean-Hettmansperger F test with confidence interval 
estimate, ART= Aligned-rank transform, SHARPCHl=Serlin-Harwell modified aligned-rank chi-square test, 
SHARPF = Serlin-Harwell modified aligned-rank F-test. Values in parentheses for Tj Sample size are based on 
restricting the sample size to N = 60 or 80. 
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Estimated Power Values by Distribution and Sample Size 
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Appendix A 



Estimating r 

In this section we briefly describe the T scale parameter, its use in the McKean-Hettmansperger tests, and how it was 
estimated in our simulation. Initially it was assumed that the residuals had density f( £ ) and that T was the scale parameter 
of this density. Unfortunately, T cannot typically be estimated in a simple fashion because f(£ ) is unknown. 

The essence of the method using the standardized length of a 90% confidence interval to estimate is T is to create a 



family of confidence intervals about a point in the center of the density. These confidence levels depend only on the large- 



sample Wilcoxon signed-rank null distribution and not on the underlying distribution of the residuals. The lengths of these 
confidence intervals are then used to estimate this parameter, say T C i- The choice of 1- a is crucial in estimating T a . 
Hettmansperger and McKean (1983) reported that .90 was a suitable choice, McKean and Sievers (1989) suggested that .98 
was a suitable choice, and McKean and Sheather (1991) indicated that larger 1 -a values are needed when the ratio of 
sample size to the number of parameters is less than 5; for ratios greater than 5, a 1 -a of .80 appears to be sufficient. 
RREGRESS uses .90 by default. 

Following Aubuchon and Hettmansperger (1984), f C i = N 1/2 (L 95 - L 0 5)/(2* t x _ a ) where L« is a cutpoint (described 

below) and t is a critical t-value. First, obtain N residuals after fitting a full model and then compute the N(N+l)/2 pairwise 
(Walsh) averages among the residuals. For the Wilcoxon signed-rank statistic T, // T = N(N+1 )/4 and 

a \ = N(N+1 )(2N+l)/24. Next, find C. 05 = jU T -1 .645 a T , rounded down to the next lower integer. After sorting the 
Walsh averages the (C 05 +l)st= L 05 and (N(N+l)/2-C. 0 5)th = L 95 values are the confidence interval limits, and the 
difference between the limits is the estimated confidence interval length, say 5 . Hettmansperger and'McKean (1977) 
suggested an estimator of the form: 

f c. = [(N S /2)]/[(N-q- 1 ) l/2 t. 95 ] (Al) 

where t 95 is a t-value corresponding to the 95 lh percentile of a t-distribution with N-q-1 degrees of freedom and 
[N/(N-q-l)] 1/2 is an adjustment factor. The rationale for the adjustments that the statistic in equation (Al) is biased because 
the residuals are correlated, shrinking their variance. A drawback of f ci is that this estimator is only consistent if the 
residual distribution is symmetric. 
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A second method to estimate r uses the kernel density estimation method described in Tapia and Thompson (1978), 
Maritz (1995), and Hettmansperger (1984). Once again T is estimated (say, f win) from the density f(£ ). The asymptotic 
formula for r win is 

r™ =\/[(\2) m j f(s)d£ (A2) 

. = 1/[(12) ,/2 | f(£)dF(f) 

where f f*(£ ) As = mean density and F(f ) is a cumulative distribution function (cdf). Kernel estimation essentially uses 

J 

a histogram to estimate the density and requires that we use the data twice, once to estimate f( £ j), say, /(<?,), and once to 
estimate an empirical cdf, say F {£ ). 

Based on the formula J f(£ )dF(f ) in equation (A2), Hettmansperger (1984) proposed replacing f(£ ) by a kernel 

estimate and estimating F( £ ) with F (£ 0- Letting f =l/[(12) l/2 y *] ((12) 1/2 comes from McKean and Hettmansperger’s 
use of Wilcoxon scores), define 

?*= j /(£)d F(£) (A3) 

The density f ( £ f ) is estimated as 

/(f i ) = N' 2 h N * 1 £ Yj w [( r i — *j)/h N ], i (A4) 

<=1 7=1 

where w is a density that is symmetric about 0, w( ) is a window, and h N is the window width. This method requires that the 
window function and the window width be selected. Many authors have pointed out that the choice of window function has 
little impact on the results (e.g., Bean & Tsokos, 1980), and Minitab’s RREGRESS command uses a uniform density. The 
choice of the window width h N , on the other hand, is crucial, and is similar to the choice of the confidence level in the 
confidence interval method. Large window widths lead to density estimates with small variance but substantial bias, 
whereas small window widths lead to density estimates with larger variance but smaller bias (Maritz, 1995, p. 28). Several 
authors have argued that minimizing bias is the more important of the two (e.g., Aubuchon & Hettmansperger, 1984; Bean & 
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Tsokos, 1980). 



Aubuchon and Hettmansperger (1984) modified equation (A4) to reduce bias, producing 

y* = 1/[N£]+ l/[(N(N-l)h N ]£ w[(r f - rj )/h N ] (A5) 

i*j 

where K =4.1078^ (S - sample interquartile range computed on the full model residuals), h N = -K"NT I/2 , and 

y w[(r f - rj)/h N ] requires that we compute the N(N-1) possible pairwise differences among the full model residuals and 

i*J 

assign 1 if a pairwise difference divided by hN is between -.50 and -.50, and 0 otherwise. (Aubuchon and Hettmansperger 
(1984) showed that using equation (A5) rather than (A4) reduces the bias in estimating r from 0(N' 2/3 ) to 0(N'')). The use 
of equation (A5) requires specifying the scale of the underlying distribution as well as its shape. RREGRESS assumes that 
the distribution is normal, which is where 4.1078 comes from (see Hettmansperger, 1984, p. 249). Finally, 
f (NT =l/[(12) ,/2 r*] (A6) 

An advantage of this method is that estimates of r win are consistent without assuming symmetry of the underlying 
distribution. However, the statistic in equation (A6) is still biased because the residuals are correlated. Various corrections 
have been proposed, and the RREGRESS command uses [N/(N-q-l)] 1/2 . The confidence interval (Cl) and window (WIN) 
estimators are (asymptotically) equally accurate but show differences in small sample performance (Draper, 1988). We 
computed both in our simulation study for each McKean and Hettmansperger test. 
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Appendix B 

Generating Data To Satisfy the Type I Error and Power Conditions 



Ho True 



The calculations were based on inverting the general correlation matrix 



P = 





y 


x, 


*2 


X 3 


X 4 


y 


r. i 


P yxl 


P yx2 


P yx3 


P yx4 


Xi 




1 


P xlx2 


P xlx3 


P xlx4 


*2 






1 


P x2x3 


P x2x3 


*3 








1 


P x3x4 


X 4 










1 






Here correlations with y appear in the first row and column, correlations for the reduced set of predictors (x r x 2 ) in rows and 
columns 2 and 3, and correlations for the X3-X4 predictors in rows and columns 4 and 5. Inverting this matrix results in the 
element in the first row and column, R''n=l/(1-R 2 ) in predicting y from the other variables in the matrix. We began with the 
reduced model matrix with p x ix 2 = p X 3 x 4 = 0, 



P yx2 
0 

1 

and the element R' 1 11 is \l{\-2p 2 ^\). Solving for /^yields p 2 ^\=. 1 and p yxl =(.l) 1/2 (we drop the subscripts on p 
involving x from here on since, for example, p^\= p y^). Thus, the correlation between y and x\ and between y and x 2 
when the correlation between Xi and x 2 (and between x 3 and X4) equals zero was (.l) 1/2 = 3162. Repeating this process for 



P reduced 



P yxi 



P y*i 



P yx2 



p x ,x 2 = p x3x4 = .3, the element R‘\i is .91/(.91-1 .4 p 2 „i). Solving, we get p ^ = (. 13) 1/2 , so the correlation between y and 

X] and between y and x 2 when the correlation between x, and x 2 (and between x 3 and X 4 ) equals 0.3 was (,13) l/2 =.3605. 

When the correlation between X] and x 2 (and between x 3 and X4) equals zero and the reduced model accounts for 
R 2 =. 2, under the null hypothesis, the full model must also account for R 2 =2. For the matrix (assuming p x ix2 = P x 3 x4 = 0), 



x2 



x 3 



x4 



y 

n 



.3162 
.3162 
P yx -3 
P yx -3 



X] 

.3162 

1 

0 



.3 



x 2 

1 1 AO 

A. \J 

0 



x 3 

P yx 

.3 



X4 

P yx 

.3 



"N 



.3 .3 

0 






J 



the element R' 1 1 is 



.64 



7= — , and so R 2 = .2 H ( p VY — .6^. T) 2 . For the case when Ho is 

,8(.64)-2(p„-.6V4) 2 -frr" 



true, set Pyx = (.6)(.3162). Repeating this for the case for p x ix2 =p x3 x4 - .30 produces the element R''n- 



.91 



.637Q931) 

.728(.637)(.93 1) - 1 .27(.91 p YX - .42x/l3) 2 _ 



and so 



1.27 



. — .42 1 — 

R 2 = .2 4 — — (.91/X y — .42V.13) 2 . ForHotrueset p YX = V-13 =.1664. 

.91(.637)(.931) .91 



Ho False 

Assuming the x, k are fixed, the above results can be used to calculate the correlations needed to generate a power of 
0.7. In each of the R 2 formulae, we see that part of the change in R 2 depends on the correlation between y and x 3 and 
between y and X4. What is needed is to compute the change in R-square (over the reduced model R-square of 0.2) needed to 
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yield a power of 0.7 (The noncentrality parameter here is NAR 2 /(I - R 2 ) ). The NCSSCALC program (downloadable at 

http://www.ncss.com/download.html) was used to calculate the required change in R 2 (just to make sure the results were 
replicated using the G-Power program downloadable at http://www.psycho.uni-duesseldorf.de/aap/projects/gpower/). The 
above formulas were then used to solve for p YX . For p*m=p x3x4 = 0 and N=20 the resulting correlation matrix was: 



1 


VI 


VI 


.476386 


.476386 


VI 


1 


0 


0.3 


0.3 


VI 


0 


i 


0.3 


0.3 


.476386 


0.3 


0.3 


1 


0 


.476386 


0.3 


0.3 


0 


1 



However, an adjustment of p YX was needed because the Xjk were random, meaning that the F follows a confluent 

hypergeometric distribution. The result of the adjustment was that p YX = .476386 was changed to .497 to generate the 
desired power of 0.7 under these conditions. For xlx2 = p x3x4 = .30 and N=20, the resulting correlation matrix was: 



1 


vn 


V. 13 


.529404 


.529404^ 


V. 13 


1 


0.3 


0.3 


0.3 


V. 13 


0.3 


1 


0.3 


0.3 


.529404 


0.3 


0.3 


1 


0.3 


.529404 


0.3 


0.3 


0.3 


1 ; 



The adjustment changed p YX = .529404 to .559. For p x \a = p x3x4 = .30 and N=60, the resulting correlation matrix was: 
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1 


VH 


VI 3 


.387814 


.387814 


VH 


1 


0.3 


0.3 


0.3 


V. 13 


0.3 


1 


0.3 


0.3 


.387814 


0.3 


0.3 


1 


0.3 


.387814 


0.3 


0.3 


0.3 


1 



The adjustment changed p YX = .387814 to .389. 
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